feat(transcode): r2 fixes — typed Arrow + codec_route + partial writes + CachedOntology + route validation#310
Conversation
partial writes + CachedOntology + route validation Addresses the five concrete gaps the brutally-honest review on #309 called out: 1. arrow_type_for_semantic no longer collapses everything to Utf8. Currency → Float32, Date(_) → Date32, CustomerId / InvoiceNumber → UInt64. DataFusion can now do real numeric / temporal predicate pushdown on those columns. The remaining semantic types stay Utf8 by deliberate choice — round 3 may pivot specific ones (Geo → struct{lat,lon}) when a consumer asks. 2. CachedOntology helper extracted upstream. Bundles an Arc<Ontology> with eagerly-projected DTOs per Locale (De/En). Prevents the per-call OntologyDto::from_ontology rebuild that medcare-rs's MedcareOntology and smb-office-rs's session ontology both grew independently. One implementation, one bug surface. 3. validate_route(route, ontology) added to parallelbetrieb. Parses /api/{entity_type}/{...} and asserts entity_type resolves to a declared Schema.name (case-insensitive). 4 tests covering accept-valid, reject-typo, reject-missing-prefix, reject-empty. Used as a pre-flight check for static route lists; the runner itself doesn't gate on it because typo routes are still genuine drift telemetry. 4. from_columns_partial added for PATCH-style upserts. Allows omitted Optional / Free columns (filled with Arrow null arrays); still rejects missing Required columns and undeclared columns. The existing from_columns keeps its strict full-row contract. 5. route_for_column now reads OuterColumn.codec_route directly, copied from the upstream PropertySpec.codec_route at schema derivation. Round-1 went through route_tensor which is calibrated for model-weight tensor names (q_proj, lm_head, ...) and would silently mis-classify document predicates. The contract's own field is now the source of truth — drift-by- construction impossible. Adds 7 new tests (6 in transcode/ + 1 implicit in cam_pq_decode where the existing test was rewritten to assert the new semantics): - cached_ontology_projects_every_locale_at_construction - cached_ontology_clones_are_arc_cheap - cached_ontology_inner_round_trips - validate_route_accepts_known_entity_type - validate_route_rejects_typo_entity_type - validate_route_rejects_missing_api_prefix - validate_route_rejects_empty_entity_segment - route_for_required_scalar_column_uses_property_spec_default (rewritten from route_for_scalar_columns_skips_codec; old test asserted the wrong thing — required scalars should default to Passthrough, not Skip) Verified: cargo check across {default, query-lite, query-lite+audit-log+ auth-rls-lite} — clean cargo test transcode:: → 33/33 passed (was 26/26 in #309; +7 new + 1 rewritten) cargo clippy on the same combos — zero transcode warnings rustfmt --check — clean across all 4 modified files
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8ee50cc4d3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| fn arrow_type_for_semantic(st: &SemanticType) -> ArrowTypeCode { | ||
| match st { | ||
| SemanticType::Currency(_) => ArrowTypeCode::Float32, | ||
| SemanticType::Date(_) => ArrowTypeCode::Date32, |
There was a problem hiding this comment.
Map DateTime semantic types to timestamp-capable Arrow type
arrow_type_for_semantic now sends every SemanticType::Date(_) to Date32, but SemanticType::Date includes DatePrecision::DateTime; this collapses all intra-day values to whole days and makes time-of-day predicates impossible for those fields. Any ontology property marked as Date(DateTime) will either lose precision at write time or be forced into an incompatible column representation.
Useful? React with 👍 / 👎.
| /// `OuterColumn.semantic_type` field. | ||
| fn arrow_type_for_semantic(st: &SemanticType) -> ArrowTypeCode { | ||
| match st { | ||
| SemanticType::Currency(_) => ArrowTypeCode::Float32, |
There was a problem hiding this comment.
Preserve monetary precision for currency columns
Mapping SemanticType::Currency(_) to Float32 introduces rounding for financial amounts (especially at larger magnitudes), which can change cents-level values and produce unstable equality/aggregation behavior. This is a regression from lossless string transport and can corrupt financial semantics in consumers that rely on exact totals or exact-match filters.
Useful? React with 👍 / 👎.
Summary
Round-2 follow-up to #309 addressing all five concrete gaps the brutally-honest review called out.
arrow_type_for_semanticcollapsed every semantic type toUtf8— DataFusion couldn't doWHERE amount > 1000.0predicate pushdownCurrency→Float32,Date(_)→Date32,CustomerId/InvoiceNumber→UInt64Ontologypattern — medcare-rs and smb-office-rs both grew their own*Ontologycache independentlyCachedOntologyupstream that bundlesArc<Ontology>+ eagerly-projected DTOs per localevalidate_route(route, ontology) -> Result<(), String>+ 4 testsfrom_columnsrequires every declared column — broke PATCH-style upsertsfrom_columns_partialallows missing Optional/Free; still rejects missing Required + undeclaredroute_for_columnusedroute_tensor(calibrated for model-weight tensor names) — silently mis-classified document predicatesOuterColumn.codec_routedirectly, copied from the upstreamPropertySpec.codec_routeWhy each fix matters
#1 — Real predicate pushdown
The reviewer was right that this is the single biggest functional gap in #309. With
Currency → Float32, MedCare's lab-resultvaluecolumn can finally takeWHERE value > 5.0through DataFusion's scan filter rather than a string compare. Three semantic types covered (matching the reviewer's "at minimum" suggestion); the rest stayUtf8by deliberate choice — round 3 can pivot specific ones (e.g.Geo→ struct{lat, lon}) when a consumer actually asks.#2 —
CachedOntologyupstreamBoth consumer crates had to build their own caching wrapper. One canonical implementation here, with
Arc<OntologyDto>clones that are pointer-equal across calls.#5 — Drift-by-construction-impossible
This is the design correctness fix.
PropertySpecalready carriescodec_route. Round-1's heuristic viaroute_tensorwas calibrated for model-weight tensor names and would silently mis-classify document predicates. Round-2 reads the contract's own field. The transcode layer can never disagree with the schema author's intent.#3 — Route-validation tests
4 tests covering: accept-valid (
/api/patient/42), reject-typo (/api/patient_typo/42), reject-missing-prefix (/v1/patient/42), reject-empty (/api/).#4 — Partial-row PATCH
Optional/Free columns may be omitted (filled with Arrow null arrays via
arrow::array::new_null_array). Required columns still mandatory. Undeclared columns still rejected. The strictfrom_columnskeeps its full-row semantics;from_columns_partialis the new entry point for PATCH callers.What's still deferred
The reviewer listed two more items as round-2 lifts that this PR does not address:
.claude/plans/sql-spo-ontology-bridge-v1.md.zerocopy.rsstill only goes outer → Arrow. The reverse path (BindSpace columns → external SoA batch) needs the producer-side accessor incognitive-shader-driverfirst.Both are honestly tagged in the original module docs and should land incrementally.
Test scoreboard
Verification
cargo check: clean across 3 feature combos —default,query-lite,query-lite + audit-log + auth-rls-litecargo test transcode::: 33 passed, 0 failedcargo clippy: zero transcode warnings across the same combosrustfmt --check: clean across all 4 modified filesFiles changed
crates/lance-graph-callcenter/src/transcode/mod.rs—+CachedOntology+ 3 new testscrates/lance-graph-callcenter/src/transcode/zerocopy.rs— typed semantic-type → Arrow mapping;+from_columns_partial;+OwnedColumn::Date32;+OuterColumn.codec_routefieldcrates/lance-graph-callcenter/src/transcode/cam_pq_decode.rs—route_for_columnreadsOuterColumn.codec_route; rewritten testcrates/lance-graph-callcenter/src/transcode/parallelbetrieb.rs—+validate_route+ 4 new testsGenerated by Claude Code
Generated by Claude Code